# DFA

This repository contains implementations Dual-Feedback Actor (DFA) and  SAC and Zeroth-Order Policy Gradient methods.

## Overview

The project implements and compares different approaches to reinforcement learning with human feedback:

1. **DFA**: A combination of Soft Actor-Critic with Direct Preference Optimization for continuous control tasks
2. **Zeroth-Order Policy Gradient**: Implementation of "Zeroth-Order Policy Gradient for RLHF without Reward Inference" on a stochastic GridWorld environment


## Files

- `train_full_parallel.py`: Implementation of SAC and DFA algorithms with parallel processing optimizations for continuous control environments
- `rm_test.py`: Implementation of various RLHF algorithms for a 5×5 stochastic GridWorld environment, including:
  - ZPG (Zeroth-Order Policy Gradient)
  - ZBCPG (Zeroth-Order Block Coordinate Policy Gradient)
  - RMPPO (Reward Model + PPO)
  - DFA (ours)
  - Oracle PPO (using true rewards)

## Requirements

```
gymnasium[mujoco]
stable-baselines3
torch
numpy
matplotlib
```

## Installation

1. Clone this repository
2. Install the required packages:
   ```
   pip install -r req.txt
   ```

## Usage

### DFA for Continuous Control

To run the DFA algorithm on continuous control tasks:

```python
python train_full_parallel.py
```

You can modify the algorithm choice and hyperparameters in the main section of the script:

```python
# Choose which algorithm to run
algorithm = "dfa"  # Options: "sac", "dfa", "trpo_sb3"

# Common hyperparameters
gamma = 0.99
env_name = "Pendulum-v1"  # Or "MountainCarContinuous-v0"
```

### RLHF Algorithms on GridWorld

To run the RLHF algorithms on the GridWorld environment:

```python
python rm_test.py
```

The main section of the script allows you to choose which algorithms to run by uncommenting the relevant sections.

## Key Features

- **Parallel Processing**: The DFA implementation uses PyTorch's parallel processing capabilities for improved performance
- **Wandb Integration**: Both implementations include Weights & Biases logging for experiment tracking
- **Multiple Algorithms**: Implementations of various state-of-the-art RLHF algorithms for comparison
- **Simulated Human Feedback**: The GridWorld implementation includes simulated human feedback models (Bradley-Terry and Weibull)

## Algorithm Details

### DFA

DFA combines Soft Actor-Critic with Direct Preference Optimization to learn from human preferences in continuous control tasks. The implementation includes:

- Replay buffer for experience storage
- Q-networks for value estimation
- Policy network for action selection
- DFA loss for preference optimization



## License

[Specify your license here]

## Citation

If you use this code in your research, please cite:

[Add citation information]
